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Abstract 


A large set of signals can sometimes be described sparsely using a dictionary, that is, every element 
can be represented as a linear combination of few elements from the dictionary. Algorithms for 
various signal processing applications, including classification, denoising and signal separation, 
learn a dictionary from a given set of signals to be represented. Can we expect that the error 
in representing by such a dictionary a previously unseen signal from the same source will be of 
similar magnitude as those for the given examples? We assume signals are generated from a fixed 
distribution, and study these questions from a statistical learning theory perspective. 


We develop generalization bounds on the quality of the learned dictionary for two types of con- 
straints on the coefficient selection, as measured by the expected L) error in representation when 
the dictionary is used. For the case of lı regularized coefficient selection we provide a general- 
ization bound of the order of O (vap In(mA) / m), where n is the dimension, p is the number of 


elements in the dictionary, A is a bound on the /; norm of the coefficient vector and m is the number 
of samples, which complements existing results. For the case of representing a new signal as a 
combination of at most k dictionary elements, we provide a bound of the order O(,/np1n(mk)/m) 
under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We 
further show that this assumption holds for most dictionaries in high dimensions in a strong prob- 
abilistic sense. Our results also include bounds that converge as 1/m, not previously known for 
this problem. We provide similar results in a general setting using kernels with weak smoothness 
requirements. 


Keywords: dictionary learning, generalization bound, sparse representation 


1. Introduction 


A common technique in processing signals from X = R” is to use sparse representations; that is, 
to approximate each signal x by a “small” linear combination a of elements d; from a dictionary 
D € XP, so that x ~ Da = E] aidi. This has various uses detailed in Section 1.1. The smallness of 
a is often measured using either ||a||,, or the number of non zero elements in a, often denoted |ja||o. 
The approximation error is measured here using a Euclidean norm appropriate to the vector space. 
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We denote the approximation error of x using dictionary D and coefficients from a set A by 
ha p(x) = min ||Da — || , (1) 
acA 
where A is one of the following sets determining the sparsity required of the representation: 
H; = {a : |jallo < k} 
induces a “hard” sparsity constraint, which we also call k sparse representation, while 
R, = {a: allı <à} 


induces a convex constraint that is considered a “relaxation” of the previous constraint. 
The dictionary learning problem is to find a dictionary D minimizing 














E(D) = Exwvha p(x), (2) 


where v is a distribution over signals that is known to us only through samples from it. The prob- 
lem addressed in this paper is the “generalization” (in the statistical learning sense) of dictionary 
learning: to what extent does the performance of a dictionary chosen based on a finite set of sam- 
ples indicate its expected error in (2)? This clearly depends on the number of samples and other 
parameters of the problem such as the dictionary size. In particular, an obvious algorithm is to 
represent each sample using itself, if the dictionary is allowed to be as large as the sample, but the 
performance on unseen signals is likely to disappoint. 

To state our goal more quantitatively, assume that an algorithm finds a dictionary D suited to k 
sparse representation, in the sense that the average representation error E,,(D) on the m examples 
given to the algorithm is low. Our goal is to bound the generalization error €, which is the additional 
expected error that might be incurred: 


E(D) <(1+N)En(D) +e, (3) 


where n > 0 is sometimes zero, and the bound € depends on the number of samples and problem pa- 
rameters. Since efficient algorithms that find the optimal dictionary for a given set of samples (also 
known as empirical risk minimization, or ERM, algorithms) are not known for dictionary learning, 
we prove uniform convergence bounds that apply simultaneously over all admissible dictionaries D, 
thus bounding from above the sample complexity of the dictionary learning problem. In particular, 
such a result means that every procedure for approximate minimization of empirical error (empirical 
dictionary learning) is also a procedure for approximate dictionary learning (as defined above) in a 
similar sense. 

Many analytic and algorithmic methods relying on the properties of finite dimensional Euclidean 
geometry can be applied in more general settings by applying kernel methods. These consist of 
treating objects that are not naturally represented in R” as having their similarity described by 
an inner product in an abstract feature space that is Euclidean. This allows the application of 
algorithms depending on the data only through a computation of inner products to such diverse 
objects as graphs, DNA sequences and text documents (Shawe-Taylor and Cristianini, 2004). Is 
it possible to extend the usefulness of dictionary learning techniques to this setting? We address 
sample complexity aspects of this question as well. 
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1.1 Background and Related Work 


Sparse representations are by now standard practice in diverse fields such as signal processing, 
natural language processing, etc. Typically, the dictionary is assumed to be known. The motivation 
for sparse representations is indicated by the following results, in which we assume the signals come 
from X = R” are normalized to have length 1, and the representation coefficients are constrained to 
A = Hy where k < n, p and typically ha p(x) < 1. 


e Compression: If a signal x has an approximate sparse representation in some commonly 
known dictionary D, it can be stored or transmitted more economically with reasonable pre- 
cision. Finding a good sparse representation can be computationally hard but if D fulfills 
certain geometric conditions, then its sparse representation is unique and can be found effi- 
ciently (see, e.g., Bruckstein et al., 2009). 


e Denoising: If a signal x has a sparse representation in some known dictionary D, and X= x+ V, 
where the random noise v is Gaussian, then the sparse representation found for * will likely 
be very close to x (for example Chen et al., 2001). 


e Compressed sensing: Assuming that a signal x has a sparse representation in some known dic- 
tionary D that fulfills certain geometric conditions, this representation can be approximately 
retrieved with high probability from a small number of random linear measurements of x. The 
number of measurements needed depends on the sparsity of x in D (Candes and Tao, 2006). 


The implications of these results are significant when a dictionary D is known that sparsely rep- 
resents simultaneously many signals. In some applications the dictionary is chosen based on prior 
knowledge, but in many applications the dictionary is learned based on a finite set of examples. To 
motivate dictionary learning, consider an image representation used for compression or denoising. 
Different types of images may have different properties (MRI images are not similar to scenery 
images), so that learning a dictionary specific to each type of images may lead to improved perfor- 
mance. The benefits of dictionary learning have been demonstrated in many applications (Protter 
and Elad, 2007; Peyré, 2009). 

Two extensively used techniques related to dictionary learning are Principal Component Anal- 
ysis (PCA) and K-means clustering. The former finds a single subspace minimizing the sum of 
squared representation errors which is very similar to dictionary learning with A = H; and p =k. 
The latter finds a set of locations minimizing the sum of squared distances between each signal and 
the location closest to it which is very similar to dictionary learning with A = Hı where p is the 
number of locations. Thus we could see dictionary learning as PCA with multiple subspaces, or as 
clustering where multiple locations are used to represent each signal. The sample complexities of 
both algorithms are well studied (Bartlett et al., 1998; Biau et al., 2008; Shawe-Taylor et al., 2005; 
Blanchard et al., 2007). 

This paper does not address questions of computational cost, though they are very relevant. 
Finding optimal coefficients for k sparse representation (that is, minimizing (1) with A = Hx) is NP- 
hard in general (Davis et al., 1997). Dictionary learning as the optimization problem of minimizing 
(2) is less well understood, even for empirical v (consisting of a finite number of samples), despite 
over a decade of work on related algorithms with good empirical results (Olshausen and Field, 1997; 
Lewicki et al., 1998; Kreutz-Delgado et al., 2003; Aharon et al., 2006; Lee et al., 2007; Mairal et al., 
2010). 
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The only prior work we are aware of that addresses generalization in dictionary learning, by 
Maurer and Pontil (2010), addresses the convex representation constraint A = R}; we discuss the 
relation of our work to theirs in Section 2. 


2. Results 


Except where we state otherwise, we assume signals are generated in the unit sphere S’-!. Our 
results are: 

A new approach to dictionary learning generalization. Our first main contribution is an ap- 
proach to generalization bounds in dictionary learning that is complementary to the approach used 
by Maurer and Pontil (2010). The previous result, given below in Theorem 6 has generalization 
error bounds (the € of inequality (3)) of order 


o (omino (a+ Va) /m) 


on the squared representation error. A notable feature of this result is the weak dependence on the 
signal dimension n. In Theorem 1 we quantify the complexity of the class of functions h4,p over 
all dictionaries whose columns have unit length, where A C R}. Combined with standard methods 





of uniform convergence this results in generalization error bounds € of order O ( np\n(mA) / m) 


when n = 0. While our bound does depend strongly on n, this is acceptable in the case n < p, 
also known in the literature as the ““over-complete” case (Olshausen and Field, 1997; Lewicki et al., 
1998). Note that our generalization bound applies with different constants to the representation error 
itself and many variants including the squared representation error, and has a weak dependence on 
A. The dependence on A is significant, for example, when ||a||, is used as a weighted penalty term 
by solving min, ||Da — X || +y- ||a|],; in this case A = O (y~!) may be quite large. 

Fast rates. For the case n > 0 our methods allow bounds of order O(npIn(Am) /m). The main 
significance of this is in that the general statistical behavior they imply occurs in dictionary learn- 
ing. For example, generalization error has a “proportional” component which is reduced when the 
empirical error is low. Whether fast rates results can be proved in the dimension free regime is an 
interesting question we leave open. Note that due to lower bounds by Bartlett et al. (1998) of order 
vm! on the k-means clustering problem, which corresponds to dictionary learning for 1-sparse 
representation, fast rates may be expected only with n > 0, as presented here. 

We now describe the relevant function class and the bounds on its complexity, which are proved 
in Section 3. The resulting generalization bounds are given explicitly at the end of this section. 


Theorem 1 For every € > 0, the function class 
Gy = {hr p: S! +R: DER"? |d| < 1}, 


taken as a metric space with the distance induced by ||-||,,, has a subset of cardinality at most 
(4X/e)"”, such that every element from the class is at distance at most £ from the subset. 


While we give formal definitions in Section 3, such a subset is called an € cover, and such a 
bound on its cardinality is called a covering number bound. 

Extension to k sparse representation. Our second main contribution is to extend both our ap- 
proach and that of Maurer and Pontil (2010) to provide generalization bounds for dictionaries for k 
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sparse representations, by using a bound A on the /; norm of the representation coefficients when the 
dictionaries are close to orthogonal. Distance from orthogonality is measured by the Babel function 
(which, for example, upper bounds the magnitude of the maximal inner product between distinct 
dictionary elements) defined below and discussed in more detail in Section 4. 


Definition 2 (Babel function, Tropp 2004) For any k € N, the Babel function ug : R"*™" — R*+ is 


defined by: 
D)= ; ; 
m (D) E Ae Tie yey PUJA = PNG i a 


The following proposition, which is proved in Section 3, bounds the 1-norm of the dictionary 
coefficients for a k sparse representation and also follows from analysis previously done by Donoho 
and Elad (2003) and Tropp (2004). 


Proposition 3 Let each column di of D fulfill \\dj\| € [1,y] and uk—-ı (D) < 6 < 1, then a coeffi- 
cient vector a € R? minimizing the k-sparse representation error hy, p(x) exists which has |\a||, < 


yk/(1—ò). 


We now consider the class of all k sparse representation error functions. We prove in Section 3 
the following bound on the complexity of this class. 


Corollary 4 The function class 
Fsk = {hap ‘Gt les R: uk—-ı(D) < 8,d; € Same ; 


taken as a metric space with the metric induced by ||-\|,,, has a covering number bound of at most 


(4k/(e(1—8)))". 


The dependence of the last two results on 4,—1(D) means that the resulting bounds will be 
meaningful only for algorithms which explicitly or implicitly prefer near orthogonal dictionaries. 
Contrast this to Theorem 1 which does not require significant conditions on the dictionary. 

Asymptotically almost all dictionaries are near orthogonal. A question that arises is what values 
of uķ—ı can be expected for parameters n,p,k? We shed some light on this question through the 
following probabilistic result, which we discuss in Section 4 and prove in Appendix B. 


Theorem 5 Suppose that D consists of p vectors chosen uniformly and independently from S'~!. 
Then we have 3 
z (n—2) (2) 
P(e > 8) </ 5p (p— exp | -—“— 


Since low values of the Babel function have implications to representation finding algorithms, 
this result is of interest also outside the context of dictionary learning. Essentially it means that 
random dictionaries whose cardinality is sub-exponential in (n — 2) /k? have low Babel function. 

New generalization bounds for lı case. The covering number bound of Theorem 1 implies sev- 
eral generalization bounds for the problem of dictionary learning for lı regularized representations 
which we give here. These differ from bounds by Maurer and Pontil (2010) in depending more 
strongly on the dimension of the space, but less strongly on the particular regularization term. We 
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first give the relevant specialization of the result by Maurer and Pontil (2010) for comparison and 
for reference as we will later build on it. This result is independent of the dimension n of the un- 
derlying space, thus the Euclidean unit ball B may be that of a general Hilbert space, and the errors 
measured by h4,p are in the same norm. 


Theorem 6 (Maurer and Pontil 2010) Let A C Ry, and let v be any distribution on the unit sphere 
B. Then with probability at least 1 — e~* over the m samples in Em drawn according to v, for all 
dictionaries D C B with cardinality p: 





x 


p? (1404 1/2\/im 6m) 
+ 


m 2m 





Eli p < Emh4 p+ 


Using the covering number bound of Theorem 1 and a bounded differences concentration in- 
equality (see Lemma 21), we obtain the following result. The details are given in Section 3. 


Theorem 7 Let À > e/4, with v a distribution on S"~!. Then with probability at least 1 — e~* over 
the m samples in Em drawn according to v, for all D with unit length columns: 


npln (4/mÀ x 4 
Ehr, p < Emhr,,p + \/ : avin ) i Al Sm | Ve 


Using the same covering number bound and the general result Corollary 23 (given in Section 
3), we obtain the following fast rates result. A slightly more general result is easily derived by using 
Proposition 22 instead. 





Theorem 8 Let} > e/4, np > 20 and m > 5000 with v a distribution on S’~!. Then with probability 
at least 1 — e™ over the m samples in Em drawn according to v, for all D with unit length columns: 


Ehr, p < 1.1EmhR, D + ana 

Note that the absolute loss hp, p in the new bounds can be replaced with the quadratic loss hh, D 
used in Theorem 6, at a small cost: an added factor of 2 inside the 1n, and the same applies to many 
other loss functions. This applies also to the cover number based bounds given below. 

Generalization bounds for k sparse representation. Proposition 3 and Corollary 4 imply certain 
generalization bounds for the problem of dictionary learning for k sparse representations, which we 
give here. 

A straight forward combination of Theorem 2 of Maurer and Pontil (2010) (given here as The- 
orem 6) and Proposition 3 results in the following theorem. 


Theorem 9 Let 5 < 1 with v a distribution on S"™ t}. Then with probability at least 1 — e~* over the 
m samples in Em drawn according to v, for all D s.t. ux—-ı (D) < 6 and with unit length columns: 








2 
2 2 p 14k 1 k g 
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In the case of clustering we have k = 1 and 6 = 0 and this result approaches the rates of Biau 
et al. (2008). 

The following theorems follow from the covering number bound of Corollary 4 and applying 
the general results of Section 3 as for the /; sparsity results. 


Theorem 10 Let ô< 1 with v a distribution on S'~!. Then o probability at least 1 — e~* over 
the m samples in Em drawn according to v, for all D s.t. ux- (D) < ò and with unit length columns: 


ae 2 
Eh, p < Emhn,.p + 


Theorem 11 Let 5 < 1, np > 20 and m > 5000 with v a distribution on S"~!. Then with probability 
at least 1 — e™ over the m samples in E drawn according to v, for all D s.t. uyx—ı(D) < 6 and with 
unit length columns: 








npln (5) +x 


Ehp, D < 1.1Enhy,p +9 
m 





Generalization bounds for dictionary learning in feature spaces. We further consider applica- 
tions of dictionary learning to signals that are not represented as elements in a vector space, or that 
have a very high (possibly infinite) dimension. 

In addition to providing an approximate reconstruction of signals, sparse representation can also 
be considered as a form of analysis, if we treat the choice of non zero coefficients and their magni- 
tude as features of the signal. In the domain of images, this has been used to perform classification 
(in particular, face recognition) by Wright et al. (2008). Such analysis does not require that the data 
itself be represented in R” (or in any vector space); it is enough that the similarity between data 
elements is induced from an inner product in a feature space. This requirement is fulfilled by using 
an appropriate kernel function. 


Definition 12 Let R be a set of data representations, then a kernel function x : R? — R and a 
feature mapping >: R, — H are such that: 


K(x,y) = (0 (x) 00) s 


where H is some Hilbert space. 


As a concrete example, choose a sequence of n words, and let @ map a document to the vector of 
counts of appearances of each word in it (also called bag of words). Treating K(a,b) = (0(a),0(b)) 
as the similarity between documents a and b, is the well known “bag of words” approach, appli- 
cable to many document related tasks (Shawe-Taylor and Cristianini, 2004). Then the statement 
(a) +0(b) © O(c) does not imply that c can be reconstructed from a and b, but we might consider 
it indicative of the content of c. The dictionary of elements used for representation could be de- 
cided via dictionary learning, and it is natural to choose the dictionary so that the bags of words of 
documents are approximated well by small linear combinations of those in the dictionary. 

As the example above suggests, the kernel dictionary learning problem is to find a dictionary D 
minimizing 











Exwwi,A,D(X), 
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where we consider the representation error function 
hoa.p(x) = min ||(@D)a— (Ils, 


in which ® acts as ọ on the elements of D, A € {R}, Hk}, and the norm ||-||,; is that induced by the 
kernel on the feature space H. 

Analogues of all the generalization bounds mentioned so far can be replicated in the kernel 
setting. The dimension free results of Maurer and Pontil (2010) apply most naturally in this setting, 
and may be combined with our results to cover also dictionaries for k sparse representation, under 
reasonable assumptions on the kernel. 


Proposition 13 Let v be any distribution on R, such that x ~ v implies that 0(x) is in the unit ball 


Bg, of H with probability 1. Then with probability at least 1 — e~* over the m samples in Ey, drawn 
according to V, for all D C R with cardinality p such that PD C By and uz-i ($D) <6 <1: 


2 
2 
7 2 (1-8) +172 (160 (ss) )) aie 


Ee u, p < Emh mp4 











m 2m` 


Note that in ux—-1ı(®D) the Babel function is defined in terms of inner products in H, and can 
therefore be computed efficiently by applications of the kernel. 

In Section 5 we prove the above result and also cover number bounds as in the linear case 
considered before. In the current setting, these bounds depend on the Hélder smoothness order & of 
the feature mapping b. Formal definitions are given in Section 5 but as an example, the well known 
Gaussian kernel has & = 1. We give now one of the generalization bounds using this method. 


Theorem 14 Let R, have £ covers of order (C/€)". Let x: R? — Rt be a kernel function s.t. 
K(x,y) = (0(X),0(Y)), for which is uniformly L-Hélder of order a > 0 over R, and let y= 
maxyeR ||0(x)|| 47. Let < 1, and v any distribution on R, then with probability at least 1 — e~* 
over the m samples in E,, drawn according to v, for all dictionaries D C R, of cardinality p s.t. 
Lk-1(®D) < 8 < 1 (where ® acts like 6 on columns): 








npln (vmc) x a 
+ bal — 
m 


Ehp, p < Emhu,p+Y Fre eee I 


The covering number bounds needed to prove this theorem and analogs for the other general- 
ization bounds are proved in Section 5. 


3. Covering Numbers of G} and Fs; 


The main content of this section is the proof of Theorem 1 and Corollary 4. We also show that in 
the k sparse representation setting a finite bound on A does not occur generally thus an additional 
restriction, such as the near-orthogonality on the set of dictionaries on which we rely in this setting, 
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is necessary. Lastly, we recall known results from statistical learning theory that link covering 
numbers to generalization bounds. 

We recall the definition of the covering numbers we wish to bound. Anthony and Bartlett (1999) 
give a textbook introduction to covering numbers and their application to generalization bounds. 


Definition 15 (Covering number) Let (M,d) be a metric space and S C M. Then the £ covering 
number of S defined as N(€,S,d) = min {|A| |A C M and S C (Usea Ba (a,€))} is the size of the 
minimal € cover of S using d. 


To prove Theorem 1 and Corollary 4 we first note that the space of all possible dictionaries is a 
subset of a unit ball in a Banach space of dimension np (with a norm specified below). Thus (see 
formalization in Proposition 5 of Cucker and Smale, 2002) the space of dictionaries has an € cover 
of size (4/e)"”. We also note that a uniformly L Lipschitz mapping between metric spaces converts 
€/L covers into € covers. Then it is enough to show that ¥, defined as D +> hr, p and ®; defined as 
D > hy, p are uniformly Lipschitz (when ®, is restricted to the dictionaries with ug—1 (D) <c < 1). 
The proof of these Lipschitz properties is our next goal, in the form of Lemmas 18 and 19. 

The first step is to be clear about the metrics we consider over the spaces of dictionaries and of 
error functions. 


Definition 16 (Induced matrix norm) Let p,q > 1, then a matrix A € R"*" can be considered as 
an operator A : (R”, ll) = (R, I-ll,)- The p,q induced norm is |\Al| 4 = SUP eR |x], —1 All 


Lemma 17 For any matrix D, ||D||; > is equal to the maximal Euclidean norm of any column in D. 


Proof That the maximal norm of a column bounds ||D||, , can be seen geometrically; Da/ ||a||, is a 
convex combination of column vectors, then ||Da||, < maxa, ||dj||, ||a||, because a norm is convex. 
Equality is achieved for a = e;, where d; is the column of maximal norm. a 


The images of Y, and ®, are sets of representation error functions—each dictionary induces 
a set of precisely representable signals, and a representation error function is simply a map of 
distances from this set. Representation error functions are clearly continuous, 1-Lipschitz, and into 
(0, 1]. In this setting, a natural norm over the images is the supremum norm ||-||,.. 


Lemma 18 The function Y, is \-Lipschitz from (R™™, I-l 12) to C (S"-!). 


Proof Let D and D’ be two dictionaries whose corresponding elements are at most € > 0 far from 
one another. Let x be a unit signal and Da an optimal representation for it. Then ||(D—D’)al|, < 
|D — D’||, > |lal|; < €A. If D'a is very close to Da in particular it is not a much worse repre- 
sentation of x, and replacing it with the optimal representation under D’, we have hr, p(x) < 
hr,,p(x) +€A. By symmetry we have |‘¥,(D) (x) —‘¥,(D')(x)| < Ae. This holds for all unit sig- 
nals, then ||, (D) —‘P,(D’)||,, < Àe. a 


Ilo 


This concludes the proof of Theorem 1. We now provide a proof for Proposition 3 which is used 
in the corresponding treatment for covering numbers under k sparsity. 
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Proof (Of Proposition 3) Let D* be a submatrix of D whose k columns from D achieve the minimum 
on hy,,p(x) for x € S"~!. We now consider the Gram matrix G = (p*)" Dt whose diagonal entries 
are the norms of the elements of D*, therefore at least 1. By the Gersgorin theorem (Horn and 
Johnson, 1990), each eigenvalue of a square matrix is “close” to a diagonal entry of the matrix; the 
absolute difference between an eigenvalue and its diagonal entry is upper bounded by the sum of 
the absolute values of the remaining entries of the same row. Since a row in G corresponds to the 
inner products of an element from D* with every element from D*, this sum is upper bounded by 8 
for all rows. Then we conclude the eigenvalues of the Gram matrix are lower bounded by 1 — 6 > 0. 
Then in particular G has a symmetric inverse G7! whose eigenvalues are positive and bounded from 
above by 1/ (1 —8). The maximal magnitude of an eigenvalue of a symmetric matrix coincides with 
its induced norm ||-||, 5, therefore ||G~'||, , < 1/(1—8). 


Linear dependence of elements of D* would imply a non-trivial nullspace for the invertible G. 
Then the elements of Dé are linearly independent, which implies that the unique optimal represen- 
tation of x as a linear combination of the columns of D* is Da with 


a= (() D) (D) x 


Using the above and the definition of induced matrix norms, we have 
k\ T pk i AT AT 
EEE UES (p') x 


The vector (D*)' x is in RÝ and by the Cauchy Schwartz inequality (d;,x) < y, then | (D9) xl, < 


Si 


2 


llall2 < 


























2 














2,2 
vk | (p') =|| < Vky. Since only k entries of a are non zero, ||a||, < Vk|la||)<ky/1—5). E 


Lemma 19 The function ®, is a k/(1—6)-Lipschitz mapping from the set of normalized dictionar- 
ies with t(D) < 6 with the metric induced by ||-||,, to C (St), 


The proof of this lemma is the same as that of Lemma 18, except that a is taken to be an 
optimal representation that fulfills ||a||, < à = k/ (1 — uz-ı (D)), whose existence is guaranteed by 
Proposition 3. As outlined in the beginning of the current section, this concludes the proof of 
Corollary 4. 

The next theorem shows that unfortunately, ® is not uniformly L-Lipschitz for any constant L, 
requiring its restriction to an appropriate subset of the dictionaries. 


Theorem 20 For any 1 < k <n, p, there exists c > 0 and q, such that for every € > 0, there exist 
D,D' such that ||D — D'||, 2 < € but |(hu,.0(¢) — hmv (q))| > c. 


Proof First we show that for any dictionary D there exist c > 0 and x € S"~! such that hy, p(x) > c. 
Let Vgn-1 be the uniform probability measure on the sphere, and A, the probability assigned by it to 
the set within c of a k dimensional subspace. As c \, 0, Ac also tends to zero, then there exists c > 0 
s.t. (R)Ac < 1. Then for that c and any dictionary D there exists a set of positive measure on which 
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hy, p > c, let q be a point in this set. Since hy, p(x) = hp, p(—x), we may assume without loss of 
generality that (e,,q) > 0. 

We now fix the dictionary D; its first k — 1 elements are the standard basis {e1,...,e,—1}, its 
kth element is dg = 4/1 — £? /4e; + €e;/2, and the remaining elements are chosen arbitrarily. Now 
construct D’ to be identical to D except its kth element is v = \/1 —€?/4e; + lq choosing / so that 
||v||, = 1. Then there exist a,b € R such that q = aD’; + bD’; and we have hy, p (q) = 0, fulfilling 
the second part of the theorem. On the other hand, since (e,,g) > 0, we have / < €/2, and then we 
find ||D —D'||,,» = lleex/2—Iqlly < lleex/2\| + all =£/2+! < £. m 


To conclude the generalization bounds of Theorems 7, 8, 10, 11 and 14 from the covering 
number bounds we have provided, we use the following results. Both specialize well known results 
to the case of /.. cover number bounds, thereby improving constants and simplifying the proofs. The 
first proof is simple enough we include it at the end of this section. The second result! (along with 
its corollary) gives fast rate bounds as in the more general results by Mendelson (2003) and Bartlett 
et al. (2005). 


Lemma 21 Let F be a class of {0,B] functions with covering number bound (C/e)4 > e/B under 
the supremum norm. Then for every x > 0, with probability of at least 1 — e~* over the m samples 
in Em chosen according to v, for all f € F: 


erento (fen ol) ee 


Proposition 22 Let F be a class of [0,1] functions that can be covered for any € > 0 by at most 
(C/e)4 balls of radius £ in the L.. metric where C > e and B > 0. Then with probability at least 
1 —exp (—x), we have for all f € F: 





Mea pEreKG mp =. 


m 


where K (dm B) = j2 (3+2) (H) L1 (5 ' 2) (433) H14 $ 











The corollary we use to obtain Theorems 8 and 11 follows because K (d,m, f) is non-increasing 
in d,m. 


Corollary 23 Let ¥ ,x be as above. For d > 20, m > 5000 and B = 0.1 we have with probability at 
least 1 — exp (—x) forall f € F: 


l 
arzir ri Ooa, 





Proof (Of Lemma 21) We wish to bound supseg Ef —Emf. Take Fe to be a minimal € cover of 
F, then for an arbitrary f, denoting fe an € close member of Fe, Ef — Emf < Efe — Emfe +22. In 
particular, sup seg Ef —Emf < 2€+ sup reg, Ef — Emf. To bound the supremum on the now finite 





1. We thank Andreas Maurer for suggesting this result and a proof elaborated in Appendix A. 
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class of functions, note that E f — E,,f is an average of m independent copies of the identical zero 
mean bounded variable Ef — Ei f. 

Applying Hoeffding’s inequality, we have P (E f — Ef >t) < exp (—2mB~*1”). 

The probability that any of the | ,| differences under the supremum is larger than t may be 
bounded as P (supyeg, Ef —Emf >t) <|Fel-exp(—2mB~?t*) < exp (dIn(C/e) — 2mB~*2”). 

In order to control the probability with x as in the statement of the lemma, we take —x = 
d\n (C/€) —2mB~*?? or equivalently we choose t = \/B?/2m,/d 1n (C/e€) +x. Then with probabil- 
ity 1—e “ we bound supyeg Ef —Emf < 2€+t. Using the covering number bound assumption and 
the sublinearity of \/-, we have by supyeg Ef — Emf < 2€+B (vam (C/e) /2m+ Vx]2m). The 
proof is completed by taking £ = 1/,/m. 








4. On the Babel Function 


The Babel function is one of several metrics defined in the sparse representations literature to quan- 
tify an ’almost orthogonality” property that dictionaries may enjoy. Such properties have been 
shown to imply theoretical properties such as uniqueness of the optimal k sparse representation. In 
the algorithmic context, Donoho and Elad (2003) and Tropp (2004) use the Babel function to show 
that particular efficient algorithms for finding sparse representations fulfill certain quality guaran- 
tees when applied to such dictionaries. This reinforces the practical importance of the learnability 
of this class of dictionary. We proceed to discuss some elementary properties of the Babel function, 
and then state a bound on the proportion of dictionaries having sufficiently good Babel function. 

Measures of orthogonality are typically defined in terms of inner products between the elements 
of the dictionary. Perhaps the simplest of these measures of orthogonality is the following special 
case of the Babel function. 


Definition 24 The coherence of a dictionary D is u (D) = max;z; | (di, a; |: 


The proof of Proposition 3 demonstrates that the Babel function quantifies the effects of non orthog- 
onality on the representation of a signal with particular level k+ 1 of sparsity. Is enough to bound 
the Babel function using coherence? only at a cost of significantly tightening our requirements on 
dictionaries. While the coherence and Babel measures are indeed related by the inequalities 


mı (D) < ux (D) < kui (D), 


the factor k gap between the bounds cannot be improved. The tightness of the right inequality is 
witnessed by a dictionary including k + 1 copies of the same element. That of the left inequality 
is witnessed by the following example. Let D consist of k pairs of elements, so that the subspace 
spanned by each pair is orthogonal to all other elements, and such that the inner product between 
the elements of any single pair is half. In this case u,(D) = u1 (D) = 1/2. However note that to 
ensure ug < 1 only restricting u requires the constraint uı (D) < 1/k, which is not fulfilled in our 
example. 

To better understand up (D), we consider first its extreme values. When ug(D) = 0, for any 
k > 1, this means that D is an orthogonal set (therefore p < n). The maximality of ug (D) =k we 
have seen before. 
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A well known generic class of dictionaries with more elements than a basis is that of frames (see 
Duffin and Schaeer, 1952), which includes many wavelet systems and filter banks. Some frames 
can be trivially seen to fulfill our condition on the Babel function. 


Proposition 25 Let D € R"*? be a frame of R", so that for every v € S™! we have that 
EL |(,d))|? <B, with ||d;||) = 1 for all i, and B < 1+1/k. Then u—1(D) <1. 


This may be easily verified by considering the inner products of any dictionary element with 
any other k elements as a vector in R*; the frame condition bounds its squared Euclidean norm by 
B — 1 (we remove the inner product of the element with itself in the frame expression). Then use 
the equivalence of l; and l2 norms. 


4.1 Proportion of Dictionaries with uz—ı(D) < ò 


We return to the question of the prevalence of dictionaries having u,_; <6. Are almost all dictionar- 
ies such? If the answer is affirmative, it implies that Theorem 11 is quite strong, and representation 
finding algorithms such as basis pursuit are almost always exact, which might help prove proper- 
ties of dictionary learning algorithms. If the opposite is true and few dictionaries have low Babel 
function, the results of this paper are weak. While there might be better probability measures on the 
space of dictionaries, we consider one that seems natural: suppose that a dictionary D is constructed 
by choosing p unit vectors uniformly from S’~!; what is the probability that y,_1(D) < 6? how 
does this depend on p,k? 

Theorem 5 gives us the following answer to these questions. Asymptotically almost all dictio- 
naries under the uniform measure are learnable with O (np) examples, as long as kln p = 0(./n). 


5. Dictionary Learning in Feature Spaces 


We propose in Section 2 a scenario in which dictionary learning is performed in a feature space 
corresponding to a kernel function. Here we show how to adapt the different generalization bounds 
discussed in this paper for the particular case of R” to more general feature spaces, and the de- 
pendence of the sample complexities on the properties of the kernel function or the corresponding 
feature mapping. We begin with the relevant specialization of the results of Maurer and Pontil 
(2010) which have the simplest dependence on the kernel, and then discuss the extensions to k 
sparse representation and to the cover number techniques presented in the current work. 

A general feature space, denoted H, is a Hilbert space to which Theorem 6 applies as is, under 
the simple assumption that the dictionary elements and signals are in its unit ball; this assumption 
is guaranteed by some kernels such as the Gaussian kernel. Then we take v on the unit ball of H to 
be induced by some distribution v’ on the domain of the kernel, and the theorem applies to any such 
v’ on R. Nothing more is required if the representation is chosen from Ry. The corresponding gen- 
eralization bound for k sparse representations when the dictionary elements are nearly orthogonal 
in the feature space is given in Proposition 13. 

Proof (Of Proposition 13) Proposition 3 applies with the Euclidean norm of H, and y= 1. We apply 
Theorem 6 with A = k/ (1 — ò). a 


The results so far show that generalization in dictionary learning can occur despite the poten- 
tially infinite dimension of the feature space, without considering practical issues of representation 
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and computation. We now make the domain and applications of the kernel explicit in order to 
address a basic computational question, and allow the use of cover number based generalization 
bounds to prove Theorem 14. We now consider signals represented in a metric space (Rd), in 
which similarity is measured by the kernel « corresponding to the feature map 0: R, —> H. The 
elements of a dictionary D are now from R, and we denote ®D their mapping by to H. The 
representation error function used is hg 4 p. 

We now show that the approximation error in the feature space is a quadratic function of the 
coefficient vector; the quadratic function for particular D and x may be found by applications of the 
kernel. 


Proposition 26 Computing the representation error at a given x,a,D requires O ( p’) kernel appli- 
cations in general, and only O (K? + p) when a is k sparse. 


The squared error expands to 


p p 
„abaj x(di,dj )+xK(x,x)—2} aix (x,di). 
j=l i=l 


7 aL 


We note that the k sparsity constraint on a poses algorithmic difficulties beyond those addressed 
here. Some of the common approaches to these, such as orthogonal matching pursuit (Chen et al., 
1989), also depend on the data only through their inner products, and may therefore be adapted to 
the kernel setting. 

The cover number bounds depend strongly on the dimension of the space of dictionary elements. 
Taking H as the space of dictionary elements is the simplest approach, but may lead to vacuous 
or weak bounds, for example in the case of the Gaussian kernel whose feature space is infinite 
dimensional. Instead we propose to use the space of data representations R, whose dimensions are 
generally bounded by practical considerations. In addition, we will assume that the kernel is not 
“too wild” in the following sense. 


Definition 27 Let L,a > 0, and let (A,d') and (B,d) be metric spaces. We say a mapping f : A —> B 
is uniformly L Holder of order & ona set S CA if Vx,y € S, the following bound holds: 


d (f(x), f(y)) <L-d'(x,y)*. 
The relevance of this smoothness condition is as follows. 


Lemma 28 A Holder function maps an £ cover of S to an Le® cover of its image f(S). Thus, to 


1/a 


obtain an £ cover of the image of S, it is enough to begin with an (€/L) '™ cover of S. 


A Hölder feature map @ allows us to bound the cover numbers of the dictionary elements in H 
using their cover number bounds in R. Note that not every kernel corresponds to a Hélder feature 
map (the Dirac 6 kernel is a counter example: any two distinct elements are mapped to elements at a 
mutual distance of 1), and sometimes analyzing the feature map is harder than analyzing the kernel. 
The following lemma bounds the geometry of the feature map using that of the kernel. 


Lemma 29 Let (x,y) = (0(x),0(y)), and assume further that x fulfills a Holder condition of order 
a. uniformly in each parameter, that is, \«(x,y) —«(x+h,y)| < L||h||*. Then uniformly fulfills a 
Holder condition of order 0/2 with constant y 2L. 
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This result is not sharp. For example, for the Gaussian case, both kernel and the feature map are 
Holder order 1. 
Proof Using the Hélder condition, we have that ||(x) — 6(y)||¢¢ = K(x,x) — K (x,y) +K(,y) — 
K (x,y) < 2L ||x— y||”. All that remains is to take the square root of both sides. = 


For a given feature mapping 6, set of representations R, we define two families of function 
classes so: 


Wr = {horp :D ED }and 
Os = {ho mD : D € D Aug (PD) < 5} . 


The next proposition completes this section by giving the cover number bounds for the repre- 
sentation error function classes induced by appropriate kernels, from which various generalization 
bounds easily follow, such as Theorem 14. 


Proposition 30 Let R be a set of representations with a cover number bound of (C/€)", and let 
either be uniformly L Holder condition of order & on R, or x be uniformly L Holder of order 20 on 
R in each parameter, and let Y = supgeg \|O(d)|| gr. Then the function classes Wy, and Q; x, taken 


np 
as metric spaces with the supremum norm, have € covers of cardinalities at most (c (AYL/€) 1/ ~) 


and (c (kyL/(e(1—8))) v J p respectively. 


Proof We first consider the case of /; constrained coefficients. If ||a||, < à and maxqev ||0(d)]| 4 < 
y then by considerations applied in Section 3, to obtain an € cover of the image of dictionaries 
{ming ||(#D)a — ¢ (x)||a : D E€ D}, it is enough to obtain an €/ (Ay) cover of {®D: De D}. If 
also the feature mapping © is uniformly L Hölder of order a over R, then an (AyL/ e)" % cover 


np 
of the set of dictionaries is sufficient, which as we have seen requires at most | C (AyL/ e)!/ ®) 


elements. 
In the case of lọ constrained representation, the bound on A due to Proposition 3 is yk (1 — 8), 
and the result follows from the above by substitution. a 


6. Conclusions 


Our work has several implications on the design of dictionary learning algorithms as used in signal, 
image, and natural language processing. First, the fact that generalization is only logarithmically 
dependent on the /; norm of the coefficient vector widens the set of applicable approaches to pe- 
nalization. Second, in the particular case of k sparse representation, we have shown that the Babel 
function is a key property for the generalization of dictionaries. It might thus be useful to modify 
dictionary learning algorithms so that they obtain dictionaries with low Babel functions, possibly 
through regularization or through certain convex relaxations. Third, mistake bounds (e.g., Mairal 
et al. 2010) on the quality of the solution to the coefficient finding optimization problem may lead to 
generalization bounds for practical algorithms, by tying such algorithms to k sparse representation. 
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The upper bounds presented here invite complementary lower bounds. The existing lower 
bounds for k = 1 (vector quantization) and for k = p (representation using PCA directions) are 
applicable, but do not capture the geometry of general k sparse representation, and in particular 
do not clarify the effective dimension of the unrestricted class of dictionaries for it. We have not 
excluded the possibility that the class of unrestricted dictionaries has the same dimension as that of 
those with a small Babel function. The best upper bound we know for the larger class, being the 
trivial one of order O ( Gi n? / m), leaves a significant gap for future exploration. 

We view the dependence on ug—ı from an “algorithmic luckiness” perspective (Herbrich and 
Williamson, 2003): if the data are described by a dictionary with low Babel function the general- 
ization bounds are encouraging. 
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Appendix A. Generalization with Fast Rates 


In this appendix we give a proof, essentially due to Andreas Maurer, of the fast rates result Proposi- 
tion 22. The assumption of l cover numbers allows a much simpler argument than that in the more 
general results by Mendelson (2003) and Bartlett et al. (2005), which also leads to better constants 
for this case. 

Proof (Of Proposition 22) We take G to be an + cover of ¥ as guaranteed by the assumption. Then 
for any f € F, there exists g € G such that || f — g||,, < L, and Lemmas 31 and 33 apply. we have 
with probability at least 1 — exp (—x), for every f € F: 

















1 1 
Een Se A (Ens- =) (4) 

2 

= — + Eg — Eng 
m 

<2 4 ea | 2 (dln (Cm) +x) 6) 
m m 3m 
2 2) [2(din(Cm) +x) , 2(din(Cm) +x) 

< Za (yaga Z) [een anG © 


Inequality (4) follows from Lemma 33 and 
1 1 
Ef < Eg + — and Emf > Emg — —. 
m m 


Inequality (5) follows from Lemma 31: 











2Varg (din (Cm) +x) , 2(d1n (Cm) +a) Ee 


m 3m 


Pr(3s€ G: 26> Ens H y 
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Inequality (6) follows from Lemma 33 because 


(aa -yag | CO) < ( Var + 2) (BOGE) 


m m m 

















After slight rearrangement, we have 









































page E 2 es p | 
2 Var f (dln (Cm) +x) ( 9 Emt 2 
< y +2 (7) 
m ym 3m m 
š o i (= 2) a re 8 
2Ef (dln (Cm) +x) 9 d+3\ dln (Cm) +x 
<y m X (= 2) ( 3d ) m ©) 


Simple algebra, the fact that Var f < E f for a [0, 1] valued function f and Lemma 37 respectively 
justify inequalities (7), (8) and (9). 
For convenience, we denote K = J + 2) (43). We also denote A = Ep f +K din(Cm) +x and 


B = (dln (Cm) +x) /m, and note we have shown that with probability at least 1 — exp (—x we have 
Ef —A < \/2BEf, which by Lemma 34 implies Ef < A +B + V2AB + B?. By substitution and 
Lemma 36 we conclude that then: 


Ef <A+B+\/2AB+B? 


d\n(Cm) +x 
m 





mf +K +B? 





I 
B y OBE, f +2BK PC +# 
m 

















din(Cm) + din(C 
aprak OOO ig a EN i pp 
m m 


m 








using Lemma 36 for the second inequality. 
From Lemma 35 with a = Em f and b = 2 (d In (Cm) K /m we find that for every à > 0 





/2Emf (dln (Cm) +x) /m < Mnf + 5 — z (din (Cm) +x) /m 


and the proposition follows. a 


The following lemma encapsulates the probabilistic part of the analysis. 


Lemma 31 Let G be a class of |0,1] functions, of finite cardinality |G| < (Cm)“. Then 


me (= ee a (Cm) +3) , 2(d1n (Cm) +2) ee 











m 3m 
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In proving Lemma 31, we use the following well known fact, which we recall for its notations. 


Lemma 32 (Bernstein Inequality) Let X; be independent zero mean variables with |X;| < c almost 


surely then Pr (+y" |X; >£) < exp (-n) 


m 


Proof (Of Lemma 31) Denote X; = Eg — g (si), where {s;}/" , is a set of IID random variables, we 
recall the notation Eng = (1/m))"  g(s;), then " , X; =", (Eg —g(s;)) =mEg—YL™ g (si) = 
m (Eg — Em) > +X} X; = Eg — En. 

Using the fact our X; are JID and the translation invariance of variance, we have 


1 m 
o? = —) Var (X;) 
mM iZi 
= Var (X;) 
= Var (Eg — g (si)) 
= Varg. 


Since ||g|| < 1, we also know|X;| < 1. 


Applying the Bernstein Inequality we get Pr (Eg — Emg > £) < exp (=) for any € > 
0. We wish to bound the probability of a large deviation by exp (—y), so it is enough for € to satisfy: 


2 mez 


me € 
— —— ] < —= <> < 
exp ( nen) < exp(-y) 2Varg+2e/3~ ° 
2 





E ee 
ie 2 Varg+2e/3 


2€ 
s= y (2varg+ =) < m2? 


2 2y V: 
a 0< e ge DTE 





3m m 








This quadratic inequality in € has the roots: (2» /(3m)+ v (2y/(3n))* + 8y Varg/ m) /2 and 
a positive coefficient for £€?, then we require € to not be between the roots. The root closer to —co 


is always negative because Vey/ (3m))* + 8y Var g/m > Vey/ (3m))* = 2y/ (3m), but the other 
is always strictly positive, so it is enough to take € greater than both. In particular, by Lemma 36, 


we may choose € = 2y/ (3m) + \/2y Var g/m > (29/ (3m) + V 2y/ (3m))* +8yVare/m) /2, and 


2 HV. 
Pr (ee z a 2s) < exp(—y). 
3m m 
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Taking a union bound over all |G|, we have: 


2 2y 2y Varg 
Pr (æ< G:Eg—Eng > 9/24 REE <|Glexp(—y) —= 
m m 
2 2 2y Var 
Pra Gize- Eas > Hy 7 £) <exptinigl-») 
3m m 
B 2 [2yV. 
Pr(3s€ 9284 Eus > 4 s) < exp (1n (Cm)! —y). 


Then we take —x = ln (Cm)? — y <> y = In (Cm)? +x and have: 


























2d In (Cm) +x 2 Varg (in (Cm)* +x) 


Pr | Je € G : Eg — Emne > 
r | dg € G : Eg — Emg 3m Fe 


< exp (—x). 





a 
Lemma 33 Let || f — g||,, < € Then under any distribution we have |E f — Eg| < £, and \/Var g — 
y Var f < 2e 
Proof The first part is clear. For the second, we need mostly the triangle inequality for norms: 
Vaf- Varg = yE (f-Ef} — yE (g-— Eg) 
=Ilf-Efll, — lls -Esll, 
< |S -Ef-8-+Esllz, 
< |f- slz, + Ef- Esllz, 
< |S = 8l +E G =- 2), 
< 2| -ell 
< 2e. 
a 


Lemma 34 /fA,B > 0 and Ef —A < /2EfB then Ef <A+B+\V2AB+B2. 
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Proof First note that if Ef < A, we are done, because A,B > 0, then we assume Ef > A. Squaring 
both sides of Ef —A <./2EfB we find 


(Ef —A)* <2EfB <> (Ef? —2EfA+A? < 2EfB 
<=> (Ef) —2EfA—2EfB < —A’ 
<=> (Ef) —2EfA—2EfB+(A+B) <—A*+(A+B) 
<= (Ef) 
<= (Ef 





Ef)? —2Ef (A+B) +(A+B) < —A? + (A+B? 
Ef —(A+B))" <-A?+(A+By 


(v of non-negative expressions) 





<=> Ef — (A+B) < +y (A +B} — 4A? 





<=> Ef < (A+B) + V 2AB + B?. 


























a 
We omit the easy proofs of the next two lemmata. 
Lemma 35 For B > 0, v2ab < Ba + $ 
Lemma 36 For any a,b > 0, Va+b < va +vb 
Lemma 37 For d,m> 1, x > 0 and C > e we have 
9 dln(Cm)+x 2 9 d+3\ dln(Cm)+x 
2: < + 2 
ae) ae nae) ar) oo 
Proof By the assumptions, Ja +2 > 2 (fact(a)) and d < dln (Cm) +x (fact (b)). Then 
9o 5 dln (Cm) + dln (Cm) + Tye 
ym 3m im 3m 
9 d\n(Cm) +x+3 
< 
= ae 3m 
9 din(Cm) +x+4d 
ga 3m 
F 9 dln (Cm) +x +4 (dln (Cm) +x) 
=; me 3m 
a eae os d\n(Cm) +x 
a m 
E 
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Appendix B. Proof of Theorem 5 


In the proof we will use an isoperimetric inequality about the sphere in high dimensions. 
Definition 38 The £ expansion of a set S in a metric space (X,d) is defined as 
Se = {x E€ X|d (x,S) < €}, 
where d(x,A) = infaca d(x, a). 
Lemma 39 (Lévy’s isoperimetric inequality 1951) Let C be one half of S"', then 


H((S"\C)) < yep (-P). 


Proof (Of Theorem 5) For any p € N we denote |p] = {1,...,p} and for i € [p], we define W; = 
MAXAcip]\i,|A|=k Laca |(di,d,)|. Then it is enough to prove that P(Ji€ [p|:W; >) < 


2 
Vn ]2p (p — l)exp (- (n—2) (3) /2). 
W; are identically distributed variables, then by a union bound, P(ai€ [p]:W;>5) < 
pP(W, > 8). 
By definition, P (W; > ò) =P (max,cip)\1JAl=k Ejea \(d1,d;)| > 5) and since È jea \(di,d;)| < 
kmax 4 \(di,d;)| always, 








P(W, >8)<P (kmax| (did) > 5) 
J 
5 
P (max (aia) >$) 


Note that max ;zı |(d1,d;}| > /k <= Jj € [p] \i : |di,d;| > 5/k. Noting the random variables 
\(di,d i) are identically distributed, and using a union bound on the choice of j, we have 
P (Wi > 8) < (p — 1)P (|(d1,d2)| > 8/k). 

Since (dı,d2) is invariant to applying to dı and d2 the same orthonogonal transformation, 
we may assume without loss of generality that dz = e1, and with another union bound note that 
P (|(d1,d2)| > 8/k) = P(|(e1,d1)| 2 8/k) < 2P ({e1,d1) > 8/k). 

The fraction ? is positive, then the set of dı on which (e1,d1} < 6/k holds includes the negative 
half sphere, and any point within 6/k of it. Then by the isoperimetric inequality of Lemma 39, 





2P ((e1,d1) = 8/k) < V/n/exp (—(n—2) (8/K)? /2) . 


The theorem results by substitution. a 
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